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1. INTRODUCTION 

Estimating the depth information of an object from its pose in an environment is an essential part of 
computer vision but with monocular cameras, it is quite difficult to estimate the object’s depth. Generally, 
monocular cameras acquire only 2D information about an object from a scene by virtue of perspective 
transformation which results in a loss of depth information [1], [2]. Therefore, obtaining the depth information 
to have complete 3D information about the object’s pose can be useful in many robotic applications such as 
pose estimation, picking and placing, and mapping. Traditional methods such as the use of Bluetooth, laser, 
ultrasonic and IR sensors have been used in the past to estimate the object’s distance [3]—[5] but with the advent 
of vision sensors, stereo vision and monocular vision are the only two predominant methods used for estimating 
the object’s distance in image-based visual servoing. The stereo vision, which is also known as the computer- 
based passive approach uses two cameras in the form of binocular structure or human eyes to estimate the 
depth information of the object [6]-[8]. This can be achieved by placing two cameras horizontally apart and at 
equal distances from their center points to capture 2D images of the object in their views [9]. Due to the distance 
separating the two cameras, the captured images are known as disparity images and are used for computing the 
depth information at the point where the field of view of the two cameras intersects. The stereo vision method 
is highly accurate but requires a large number of images to be processed in order to achieve precision. It also 
requires many complex computations due to the large number of images used hence, it is computationally 
intensive. This method is also expensive to implement because it requires the use of two cameras. In contrast 
to the stereo vision method, the monocular vision method involves the use of a single camera to estimate the 
object’s distance based on the reference points of the camera’s field of view [10]. This method is fairly accurate 
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but not computationally intensive because it requires only a few image registrations that enable the computer 
to process the images faster. Thus, this type of method can effectively reduce the system workload and save 
the computer a longer processing time [11]. The monocular method used for visual servoing purposes is cheap 
and has low handling complexity due to the use of only one camera. 


2. RELATED WORK 

Object distance measurement plays a vital role in the acquisition of objects’ depth information that 
complements the classic 2D visual perception used for robotic and autonomous systems applications. However, 
brief literature on distance estimation is presented in this section. Zhou et al. [12] used a monocular vision 
method to find the position and orientation of the object at a distance of 5 m. The relative translation and 
rotation values of X, Y, and Z directions were obtained through an unconstrained linear equation of rotation 
and translation matrix R, T and were computed using the inverse least-square method. Krishnan et al. [13] 
proposed a method of complex log mapping to measure the distance between the camera and the object’s 
surface with an arbitrary pattern. The method is based on the use of two images taken at two different camera 
positions that are known while moving the camera along its optical axis. The distance of the object to the 
camera is therefore estimated by computing the ratio between the sizes of the object projected on the two 
images. 

Chang et al. [14] proposed an efficient neural network method for achieving self-localization by a 
humanoid robot. Yang and Cao [15] also proposed a 6D pose estimation of an object using the Levenberg- 
Marquardt algorithm to refine the result of the decomposed homography matrix. Zhang et al. [16] proposed a 
method of estimating the localization of an object that is based on perspective transformation. Their method 
was presented in three stages. The first stage dealt with the calibration of the camera to calibrate the intrinsic 
parameters. The second constituted a model for computing the object’s distance through perspective 
transformation by mapping the 3D points in the real world to the 2D image of a pinhole camera. The third 
stage, which is the measurement of the absolute distance between the camera and the target object, was 
achieved through the geometry formed from the perspective projections. 

Muslikhin et al. [17] used a machine learning algorithm to classify the positions of the object in the 
image of the mono camera and then used the k-nearest neighbors (k-NN) approach to find the nearest point of 
the centroids to the closest class. Bui et al. [18] proposed the use of a single camera with a triangulation method 
to measure the distance of an object indirectly. The method is such that the distance to the object is determined 
based on one known angle and two sides of a triangle. Zheng et al. [19] presented a method of measuring an 
object's distance by a monocular vision camera on a mobile robot. However, the distance between the mobile 
robot and the target object was determined based on the sub-pixel image processing, mapping, and path 
planning method. Zhu and Fang [20] initially proposed to address the distance estimation problem with a deep- 
learning-based method by predicting directly the distance of a given object on red, green, and blue (RGB) 
images without the use of intrinsic parameters of the camera. They further enhanced the model with a key point 
regressor in which a projection loss was defined to estimate the distance of objects close to the monocular 
camera while facilitating the training and evaluation tasks with extended KITTI and nuScenes (mini) datasets 
of specified objects’ distances. 

Vajgl et al. [21] presented a Dist-YOLO method that is based on YOLO architecture in which the 
original loss function is updated to estimate the absolute distance of an object using the information from the 
monocular camera. Most of the methods used for estimating the object’s distance in the literature are 
computationally intensive but, in this paper, a monovision camera was used to obtain a set of image-based data 
with the measured distances of the object and was computed by using a curve fitting technique to derive a non- 
linear function for estimating the object’s distance. 


3. METHOD 

To determine the distance of the object from the camera, which is the depth information, a single 
Pixy2 camera was used in this study. The Pixy2 camera is a vision sensor with an embedded image processor 
that can process captured RGB images and segment them to recognize objects of different colors while using 
its built-in color-based filtering algorithm called the color-connected components (CCC). As it has the 
capability of tracking up to seven different colors, which are red, blue, green, yellow, orange, cyan, and violet, 
it also has the functionality of tracking the object’s position in the image in two dimensions The front and back 
of the views of the Pixy2 camera is shown in Figure 1. 

Though the Pixy2 camera can perform other functions such as line tracking and barcode reading [22], 
in this study, it will be used to train a specific object with a single color positioned at a sequential distance from 
the camera to acquire a dataset for determining the object’s distance. 
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Figure 1. The Pixy2 camera 


3.1. Camera set-up 

To train the Pixy2 camera to acquire the visual information of the object found in its field of view, the 
vision sensor needs to be installed in a position where the target object will be visible to the camera in order to 
avoid occlusion. So, the eye-in-hand configuration was used in this paper. The eye-in-hand configuration is a 
posture the camera takes when mounted on a manipulator and it can either be after or before the wrist of the 
robotic arm [23], [24]. Figure 2 shows the Pixy2 camera mounted on the robot manipulator that is used for a 
pick and place purpose. 


Pixy2 camera 


Figure 2. Pixy2 camera mounted on the elbow joint of the manipulator 


3.2. Distance measurement using a single Pixy2 camera 

To measure the distance of the object using a single Pixy2 camera, a set of training data that can be 
used for estimating the object’s distance was generated first from the experiment. However, in this method, a 
ripe tomato which is completely red was used as the target object in the experiment and was trained to be 
recognized by the Pixy2 camera using its PixyMon software. The ripe tomato was simultaneously positioned 
at a horizontal distance between 430 and 580 mm in front of the robotic arm in the real world; and a vertical 
distance between 0 and 207 mm of the camera’s image height. The horizontal and vertical distance parameters 
used in training the object were based on the manipulator’s length (580 mm) and the entire image height 
(207 mm) of the camera. The object (ripe tomato) was placed sequentially in the camera’s field of view (FOV) 
as shown in Figure 3. 

However, on placing the ripe tomato sequentially in the camera’s FOV, the respective distances of the 
ripe tomato from the camera’s lens were measured using a measuring tape with an accuracy of +0.5 mm. 
Therefore, to generate training data, the actual distances measured were recorded alongside the image data 
generated by the Pixy2 camera. The image data consists of the two coordinates (x, y), the width and height of 
the ripe tomato to determine the area of the bounding box as shown in Figure 4. These were estimated by the 
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Pixy2 camera based on its image processing and object tracking capabilities when the ripe tomato was placed 
sequentially within the specified horizontal and vertical distance parameters. The training data obtained from 
the experiment by placing the ripe tomato in sequential positions relative to the camera’s reference position is 
given in Table 1. 


ye ---Tomabot getting in position to detect ripe o 
error: to response ia 


Detected Ripe _Tozatoes = 2 
block O: sigt 1 x: 222 y: 170 wicth: ie eight: 24 incex: 


Figure 4. A captured ripe tomato bounded by a box in the image to obtain the trained image data for 
the computation of the object’s distance 


Table 1. Data obtained from training the Pixy2 camera to estimate the positions of the ripe tomato when 
placed sequentially in the camera’s field of view 


Trial X,(mm) Y. (mm) Width (mm) Height (mm) Area (mm?) Actual Distance (mm) 
1 274 97 24 22 528 430 
2 216 115 24 19 456 438 
3 189 125 22 20 440 463 
4 168 134 20 19 380 490 
5 140 142 20 18 360 502 
6 121 151 18 18 324 530 
7 107 166 18 17 306 545 
8 64 182 16 17 272 550 
9 28 198 16 12 192 575 
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However, to determine the object’s distance, which is the z-coordinate of the ripe tomato irrespective 
of its pose in the camera’s FOV, the least-square method which takes the best-fit curve from a given dataset 
with a minimal sum of deviations [25] was employed to obtain the relationship between the area of the 
bounding box and the actual distance obtained from the training data in Table 1. The curve-fitting plot produced 
a non-linear relationship between the actual distance and the area of the bounding box in Figure 5. 
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Figure 5. The graph of the actual distance against the area of the bounding box 


The non-linear function obtained from the graph is presented in (1), 

y = 3285.4x 03% (1) 
where y is the actual distance and x is the area of the bounding box. Hence, the distance is as in (2). 

Distance = 3285.4(Area)°3” (2) 


However, the relationship between the actual distance and the area of the bounding box variable in (2) was 
used to estimate the distance of the object from the camera. 


4. RESULTS AND DISCUSSION 

To estimate the object’s distance, the distance-area relationship in (2) was used to estimate the distance 
of the ripe tomato from the Pixy2 camera using the area of the bounding box and the actual distance data in 
Table 1. Hence, the result was validated by determining the average error of the difference between the actual 
distance and the estimated distance. It can be seen from Table 2 that the slight deviation in the estimated 
distance resulted in an average error of 1.33 mm. Also, both estimated and actual distances were compared 
graphically as shown in Figure 6. 


Table 2. Result of the estimated distance and the average error 


Trial Area(mm?) Actual Distance (mm) Estimated Distance (mm) Error (mm) 
1 528 430 436 -6 
2 456 438 458 -20 
3 440 463 463 0 
4 380 490 485 5 
5 360 502 494 8 
6 324 530 511 19 
7 306 545 520 25 
8 272 550 540 10 
9 192 575 604 -29 

Average error 1.33 
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Figure 6. Comparison of the estimated distance and actual distance of the object 


CONCLUSION 
The low-cost monovision camera and the least-square method used in this paper can estimate the 


distance of the object from the camera irrespective of its pose in the camera’s field of view under varying light 
conditions. The result from the experiment shows that the average error from the estimated object’s distance is 
1.33 mm. However, since this method is capable of complementing the 2D information that can be used for 
determining the object’s location in cartesian space, therefore, it can be applied to many robotic and 
autonomous systems applications. 
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